The purpose of this analysis is to classify and understand the clients who decide by either Belkin, or Elago. The data Frame is composed of 10000 observations.
library(ggplot2)
library(knitr)
library(ggthemes)
library(magrittr)
library(dplyr)
library(funModeling)
#summary without zipcode
BelkinComplete <- read.csv("Data/BelkinComplete.csv", header = TRUE, sep = ",")
kable(summary(BelkinComplete), caption = "Summary")
| salary | age | elevel | car | zipcode | credit | brand | |
|---|---|---|---|---|---|---|---|
| Min. : 20000 | Min. :20.00 | Min. :1.000 | Min. : 1.00 | Min. :0.000 | Min. :416.6 | Belkin:4652 | |
| 1st Qu.: 52109 | 1st Qu.:35.00 | 1st Qu.:2.000 | 1st Qu.: 5.00 | 1st Qu.:2.000 | 1st Qu.:563.4 | Elago :5348 | |
| Median : 84969 | Median :50.00 | Median :2.000 | Median :10.00 | Median :4.000 | Median :632.1 | NA | |
| Mean : 84897 | Mean :49.81 | Mean :2.339 | Mean :10.47 | Mean :4.037 | Mean :632.1 | NA | |
| 3rd Qu.:117168 | 3rd Qu.:65.00 | 3rd Qu.:3.000 | 3rd Qu.:16.00 | 3rd Qu.:6.000 | 3rd Qu.:701.3 | NA | |
| Max. :150000 | Max. :80.00 | Max. :4.000 | Max. :20.00 | Max. :8.000 | Max. :849.0 | NA |
#doing way too much to be able to render str() as a table
data.frame(
variable = names(BelkinComplete),
classe = sapply(BelkinComplete, typeof),
first_values = sapply(BelkinComplete, function(x)
paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable(caption = "Data Structure")
| variable | classe | first_values |
|---|---|---|
| salary | double | 123476.6418, 120274.5828, 121735.5422, 138276.8177, 126869.2463, 130595.0734 |
| age | integer | 23, 22, 26, 23, 23, 22 |
| elevel | integer | 4, 4, 4, 4, 4, 4 |
| car | integer | 1, 1, 1, 1, 1, 1 |
| zipcode | integer | 0, 2, 1, 0, 0, 7 |
| credit | double | 779.56, 784.7, 749.35, 743.85, 759.03, 774.13 |
| brand | integer | Elago, Elago, Elago, Elago, Elago, Elago |
#correcting data types
BelkinComplete$zipcode <- as.factor(BelkinComplete$zipcode)
BelkinComplete$elevel <- as.factor(BelkinComplete$elevel)
#checking the data a little deeper
kable(head(BelkinComplete, n = 10), caption = "Head of Data Frame: BelkinComplete")
| salary | age | elevel | car | zipcode | credit | brand |
|---|---|---|---|---|---|---|
| 123476.6 | 23 | 4 | 1 | 0 | 779.56 | Elago |
| 120274.6 | 22 | 4 | 1 | 2 | 784.70 | Elago |
| 121735.5 | 26 | 4 | 1 | 1 | 749.35 | Elago |
| 138276.8 | 23 | 4 | 1 | 0 | 743.85 | Elago |
| 126869.2 | 23 | 4 | 1 | 0 | 759.03 | Elago |
| 130595.1 | 22 | 4 | 1 | 7 | 774.13 | Elago |
| 121358.0 | 80 | 4 | 1 | 2 | 732.03 | Belkin |
| 127457.4 | 71 | 4 | 1 | 5 | 744.29 | Belkin |
| 137670.5 | 75 | 4 | 1 | 0 | 815.00 | Belkin |
| 120088.7 | 59 | 4 | 1 | 7 | 740.48 | Belkin |
kable(tail(BelkinComplete, n = 10), caption = "Tail of the Data Frame: BelkinComplete")
| salary | age | elevel | car | zipcode | credit | brand | |
|---|---|---|---|---|---|---|---|
| 9991 | 91704.72 | 39 | 3 | 15 | 2 | 613.97 | Elago |
| 9992 | 89303.87 | 51 | 3 | 9 | 2 | 619.06 | Elago |
| 9993 | 93765.05 | 29 | 2 | 2 | 7 | 678.69 | Elago |
| 9994 | 76441.30 | 24 | 3 | 3 | 3 | 630.89 | Elago |
| 9995 | 93055.21 | 45 | 3 | 3 | 8 | 585.84 | Elago |
| 9996 | 82700.89 | 52 | 3 | 3 | 6 | 682.34 | Elago |
| 9997 | 87488.02 | 29 | 2 | 12 | 8 | 597.31 | Elago |
| 9998 | 90905.84 | 58 | 2 | 19 | 2 | 708.59 | Elago |
| 9999 | 91315.24 | 47 | 3 | 16 | 4 | 703.44 | Elago |
| 10000 | 80023.00 | 31 | 2 | 3 | 5 | 679.04 | Elago |
#checking for missing data and unique values
kable(df_status(BelkinComplete))
variable q_zeros p_zeros q_na p_na q_inf p_inf type unique 1 salary 0 0.00 0 0 0 0 numeric 9757 2 age 0 0.00 0 0 0 0 integer 61 3 elevel 0 0.00 0 0 0 0 factor 4 4 car 0 0.00 0 0 0 0 integer 20 5 zipcode 1097 10.97 0 0 0 0 factor 9 6 credit 0 0.00 0 0 0 0 numeric 8651 7 brand 0 0.00 0 0 0 0 factor 2
| variable | q_zeros | p_zeros | q_na | p_na | q_inf | p_inf | type | unique |
|---|---|---|---|---|---|---|---|---|
| salary | 0 | 0.00 | 0 | 0 | 0 | 0 | numeric | 9757 |
| age | 0 | 0.00 | 0 | 0 | 0 | 0 | integer | 61 |
| elevel | 0 | 0.00 | 0 | 0 | 0 | 0 | factor | 4 |
| car | 0 | 0.00 | 0 | 0 | 0 | 0 | integer | 20 |
| zipcode | 1097 | 10.97 | 0 | 0 | 0 | 0 | factor | 9 |
| credit | 0 | 0.00 | 0 | 0 | 0 | 0 | numeric | 8651 |
| brand | 0 | 0.00 | 0 | 0 | 0 | 0 | factor | 2 |
ggplot(data = BelkinComplete, mapping = aes(age, salary)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("A) Age, Salary")
A) On this graph, it´s to be noticed how the distribution of the examples, occurs in an almost even way. Where at the last bracket there is a bigger concentration. It´s then relevant the thought of keeping the group as relevant for any further analysis.
ggplot(data = BelkinComplete, mapping = aes(age, zipcode)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("B) Age, Zip Code")
B) The age distribution in the different regions which the zip code defines, is also quite evenly distributed. This graph, along with the past graph, could indicate that the income level is relatively equally distributed between the regions and age borders.
ggplot(data = BelkinComplete, mapping = aes(age, car)) +
geom_point(position = "jitter", alpha = 0.2)+
theme_tufte()+
ggtitle ("C) Age, Car")
C) On the number of cars in relation with the age, there can´t be a concrete or relevant relation that can be stablished. There is a slight dominance on certain number of cars, along the entire spectrum of the age. Further studying might be needed.
ggplot(data = BelkinComplete, mapping = aes(age, elevel)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("D) Age, Education Level")
D) Contrary than what we saw before, there seems to be a strong difference between the age and the education level; the majority of the subjects in the sample, have reached an intermedium education level.
ggplot(data = BelkinComplete, mapping = aes(age, credit)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("E) Age, Credit")
E) Also the amount for the credit is concentrated in a medium scale for every age group.
ggplot(data = BelkinComplete, mapping = aes(car, credit)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("F) Car, Credit")
F) The credit amount tends to stay quite central, even when it´s shown it concentrates among certain quantity of cars heavier than in others. What might be relevant for further consideration is that the lower numbers for credit, are accumulated on the central region for cars. Around 8-14.
ggplot(data = BelkinComplete, mapping = aes(age, brand)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("G) Age, Brand")
G) The Previous graph shows how the brand preference for Belkin is stronger within the older population, while this reverses in the younger parts. On the sampling who chose Elago, this seems to be equally distributed, while on the brand who chose Belkin, the majority of the observations are concentrated on the last age bracket.
ggplot(data = BelkinComplete, mapping = aes(zipcode, salary)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("H) Zip Code, Salary")
H) While the geographical distribution shows that there is a clear majority of population in the third bracket, there seems to be no great differentiator on the distribution of the income per geographical area.
ggplot(data = BelkinComplete, mapping = aes(zipcode, credit)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("I) Zip Code, Credit")
I) There is a clear tendency on the amount of credit per region to be on the center amounts; As seen also with the age.
ggplot(data = BelkinComplete, mapping = aes(elevel, credit)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("J) Education Level, Credit")
J) Contrary to what was observed on the distribution of credit with the previous two graphs. Where it was correlated to the geographical area and the age; it has a clear tendency in this case to be concentrated among the groups with a median education level.
ggplot(data = BelkinComplete, mapping = aes(car, salary)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("K) Car, Salary")
K) Given the lack of smoothness on the distribution of the observations, and a counterintuitive grouping of such, it is possible to say that the information for the ‘car’ feature might not be correct.
ggplot(data = BelkinComplete,
mapping = aes(x = elevel, y = salary, color = brand)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("L) Education Level, Salary, Brand")
L) The education level and the salary show to have a clear, although moderate, impact on the choice of brand. The majority of the observations that point to Belkin, can be seen among middle income with a 2nd and 3rd educational level. While those with a 1st and 2nd education level with lower income, and 3rd level with higher income have a clear preference on Elago. The last portion of observations, with the highest education level and income, appear to also have the tendency to pick Elago.
ggplot(data = BelkinComplete,
mapping = aes(x = zipcode, y = age, color = brand)) +
geom_point(position = "jitter", alpha = 0.2) +
theme_tufte() +
ggtitle ("M) Zip Code, Age, Brand")
M) The observations are very discriminated by the age, yet almost not affected by the geographical area. The majority of the preference for Belkin is concentrated on the population with the oldest ages.
The salary doesn´t change with age (graph A), and most part of the observations can be found on the brackets [51-81). The observations are evenly distributed among the geographical areas (Graph B), the credit amount concentrates on the middle values for every age bracket (Graph E), different than the education level, which strongly concentrates on the middle, 2 and 3, for every age group (Graph D). There was not strong correlation founded on the features age-car (Graph C). There are empty spaces on the lower values of credit when observed with the feature car (Graph F), particularly on the values 2, 9, 11, 13, 14 for the feature car. There is an obvious preference in the first two age brackets for Elago, while on the last the observations accumulate on Belkin (Graph G). A weak relation was found between zipcode and salary (Graph H). Credit amount was grouped in the center values in relation to the geographical area (Graph I). Credit scores are concentrated on the middle education levels, 2 and 4, and there is a lot of growth opportunity on level 4, especially in the lower range (Graph J). A car-salary correlation has proven to not be very reliable (Graph K). People with a medium income and educational level, tend to choose Belkin as their brand preference (Graph L). Zip code doesn´t affect the brand preference, but age does have a relevant impact.
#checking variables correlation (spoiler alert, brand will be 1)
kable(correlation_table(data = BelkinComplete, target = "brand"))
| Variable | brand |
|---|---|
| brand | 1.00 |
| car | 0.02 |
| salary | -0.02 |
| credit | -0.02 |
| age | -0.35 |
#defining variable importance
variable_importance <- var_rank_info(data = BelkinComplete, target = "brand")
kable(variable_importance)
| var | en | mi | ig | gr |
|---|---|---|---|---|
| salary | 13.138 | 0.976 | 0.9763582 | 0.0744305 |
| credit | 13.141 | 0.861 | 0.8613010 | 0.0662230 |
| age | 6.803 | 0.113 | 0.1130479 | 0.0190988 |
| elevel | 2.698 | 0.011 | 0.0109680 | 0.0064033 |
| car | 5.232 | 0.009 | 0.0093611 | 0.0022053 |
| zipcode | 4.166 | 0.000 | 0.0002423 | 0.0000764 |
ggplot(variable_importance, aes(x = reorder(var, gr), y = gr, fill = var)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() +
xlab("") +
ylab("Variable Importance
(based on Information Gain)") +
guides(fill = FALSE)
#proving how correlated salary and credit are
cor(BelkinComplete$salary, BelkinComplete$credit)
[1] 0.8700001